58 research outputs found

    Establishing a New State-of-the-Art for French Named Entity Recognition

    Get PDF
    The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

    KOMBINASI PATI JANENG-KITOSAN DENGAN BAHAN PENGAWET ALAMI KUNYIT DAN ASAM ASKORBAT SEBAGAI EDIBLE COATING

    Get PDF
    Edible coating merupakan salah satu solusi dalam pengawetan makanan saat ini. Pada penelitian ini dibuat edible coating dari komposit edible film pati janeng-kitosan-pengawet alami (kunyit dan asam askorbat) yang diaplikasikan terhadap bakso dan keju. Komposisi terbaik dari komposit edible film dengan perbandingan pati-kitosan-kunyit (1,2% : 0,4% : 0,375%) dan pati-kitosan-asam askorbat (1,2% : 0,4% : 0,5%) yang diperoleh berdasarkan uji kuat tarik, elongasi dan warna digunakan sebagai aplikasi edible coating. Uji antimikrobial menunjukkan edible film yang dikombinasikan dengan kunyit dan asam askorbat mampu menghambat pertumbuhan bakteri E. Coli dengan diameter zona hambat yang lebih besar daripada edible film pati janeng-kitosan yaitu masing-masingnya 7 mm. Pelapisan sampel keju dengan edible coating mampu menurunkan jumlah pertumbuhan mikroba, menghambat terjadinya oksidasi lemak hingga 50% dan memperkecil kenaikan kadar air hingga 41,17% selama 3 bulan penyimpanan dibandingkan dengan keju yang tidak dilapisi edible coating, sedangkan coating pada sampel bakso mampu memperkecil kenaikan kadar air hingga 20,76%, analisis sensori terhadap sampel bakso yang dilapisi edible coating dari segi aroma, tekstur dan warna menyarankan bahwa bakso yang dilapisi lebih baik dibandingkan bakso yang tidak dilapisi setelah penyimpanan selama 3 hari. Kata kunci: Edible coating, pati janeng, kitosan, kunyit, asam askorbat, antimikrobaBanda Ace

    A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

    Get PDF
    We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures

    SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German

    Get PDF
    International audienceIn this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results

    Establishing a New State-of-the-Art for French Named Entity Recognition

    Get PDF
    Due to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.htmlInternational audienceThe French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

    French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

    Get PDF
    International audienceThis paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks

    CamemBERT: a Tasty French Language Model

    Get PDF
    Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f
    corecore